Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
translated by 谷歌翻译
最近的研究表明,即使在攻击者无法访问模型信息的黑匣子场景中,基于深模型的检测器也容易受到对抗示例的影响。大多数现有的攻击方法旨在最大程度地减少真正的积极速率,这通常显示出较差的攻击性能,因为在受攻击的边界框中可以检测到另一个最佳的边界框成为新的真实积极的框架。为了解决这一挑战,我们建议最大程度地降低真实的正速率并最大化误报率,这可以鼓励更多的假阳性对象阻止新的真实正面边界框的产生。它被建模为多目标优化(MOP)问题,通用算法可以搜索帕累托最佳选择。但是,我们的任务具有超过200万个决策变量,导致搜索效率较低。因此,我们将标准的遗传算法扩展到了随机子集选择和称为GARSDC的分裂和矛盾,从而显着提高了效率。此外,为了减轻通用算法中人口质量的敏感性,我们利用具有相似骨架的不同检测器之间的可转移性产生了梯度优先人口。与最先进的攻击方法相比,GARSDC在地图中平均减少12.0,在广泛的实验中查询约1000倍。我们的代码可以在https://github.com/liangsiyuan21/ garsdc找到。
translated by 谷歌翻译
目前,深度神经网络(DNN)在不同的应用中被广泛采用。尽管具有商业价值,但培训良好的DNN仍在资源消费。因此,训练有素的模型是其所有者的宝贵知识产权。但是,最近的研究揭示了模型窃取的威胁,即使他们只能查询模型,对手也可以获得受害者模型的功能相似的副本。在本文中,我们提出了一个有效且无害的模型所有权验证(移动),以防御不同类型的模型窃取,而无需引入新的安全风险。通常,我们通过验证可疑模型是否包含辩护人指定的外部特征的知识来进行所有权验证。具体而言,我们通过将一些训练样本带来样式转移来嵌入外部功能。然后,我们训练一个元分类器,以确定模型是否被受害者偷走了。这种方法的灵感来自于理解,即被盗模型应包含受害者模型学到的功能的知识。特别是,我们在白色框和黑框设置下开发了移动方法,以提供全面的模型保护。基准数据集的广泛实验验证了我们方法的有效性及其对潜在适应性攻击的抵抗力。复制我们方法的主要实验的代码可在\ url {https://github.com/thuyimingli/move}上获得。
translated by 谷歌翻译
快速对抗训练(脂肪)有效地提高了标准对抗训练(SAT)的效率。然而,初始脂肪遇到灾难性的过度拟合,即,对抗性攻击的稳健精度突然并大大减少。尽管有几种脂肪变体毫不费力地防止过度拟合,但他们牺牲了很多计算成本。在本文中,我们探讨了SAT和FAT的训练过程之间的差异,并观察到,对抗性实例(AES)脂肪的攻击成功率在后期训练阶段逐渐变得更糟,从而导致过度拟合。 AE是通过零或随机初始化的快速梯度标志方法(FGSM)生成的。根据观察结果,我们提出了一种先前的FGSM初始化方法,以避免在研究多种初始化策略后避免过度适应,从而在整个训练过程中提高了AE的质量。初始化是通过利用历史上生成的AE而没有额外计算成本而形成的。我们进一步为提出的初始化方法提供了理论分析。我们还基于先前的初始化,即当前生成的扰动不应过多地偏离先前引导的初始化,因此我们还提出了一个简单而有效的正规化程序。正常化器同时采用历史和当前的对抗性扰动来指导模型学习。在四个数据集上进行的评估表明,所提出的方法可以防止灾难性过度拟合和优于最先进的脂肪方法。该代码在https://github.com/jiaxiaojunqaq/fgsm-pgi上发布。
translated by 谷歌翻译
作为一种常见的安全工具,已广泛应用可见的水印来保护数字图像的版权。但是,最近的作品表明,可见的水印可以通过DNN删除而不会损坏其宿主图像。这样的水印驱动技术对图像的所有权构成了巨大威胁。受到DNN在对抗扰动方面的脆弱性的启发,我们提出了一种新颖的防御机制,可以永久地通过对抗机器学习。从对手的角度来看,可以将盲水水印网络作为我们的目标模型提出。然后,我们实际上优化了对宿主图像上不可察觉的对抗扰动,以主动攻击水印网络,称为水印疫苗。具体而言,提出了两种类型的疫苗。破坏水印疫苗(DWV)在通过水印拆除网络后,诱导了与水印一起破坏宿主图像。相比之下,不可行的水印疫苗(IWV)以另一种方式试图保持水印不清除且仍然明显。广泛的实验证明了我们的DWV/IWV在防止水印去除方面的有效性,尤其是在各种水印去除网络上。
translated by 谷歌翻译
获得训练有素的模型涉及昂贵的数据收集和培训程序,因此该模型是有价值的知识产权。最近的研究表明,即使在没有培训样本,也可以“窃取”部署模型,无法访问模型参数或结构。目前,有一些防御方法可以减轻这种威胁,主要是提高模型窃取的成本。在本文中,我们通过验证可疑模型是否包含对Defender指定的知识{外部特征}来探讨其他角度的防御。具体而言,我们通过用风格的转移回火,嵌入外部特征。然后,我们培训一个元分类器以确定模型是否从受害者中偷走。这种方法是通过了解偷窃模型应该包含受害者模型学习的特征知识的启发。我们在Cifar-10和Imagenet数据集中检查我们的方法。实验结果表明,即使通过多级窃取过程获得被盗模型,我们的方法在同时检测不同类型的模型窃取。再现主要结果的代码可在Github(https://github.com/zlh-thu/stealing验证)上获得。
translated by 谷歌翻译
对抗性训练(AT)已被证明可以通过利用对抗性示例进行训练来有效地改善模型鲁棒性。但是,大多数方法面对昂贵的时间和计算成本,用于在生成对抗性示例的多个步骤中计算梯度。为了提高训练效率,快速梯度符号方法(FGSM)在方法中仅通过计算一次来快速地采用。不幸的是,鲁棒性远非令人满意。初始化的方式可能引起一个原因。现有的快速在通常使用随机的样本不合时宜的初始化,这促进了效率,但会阻碍进一步的稳健性改善。到目前为止,快速AT中的初始化仍未广泛探索。在本文中,我们以样本依赖性的对抗初始化(即,来自良性图像条件的生成网络的输出及其来自目标网络的梯度信息的输出)快速增强。随着生成网络和目标网络在训练阶段共同优化,前者可以适应相对于后者的有效初始化,从而激发了逐渐改善鲁棒性。在四个基准数据库上进行的实验评估证明了我们所提出的方法比在方法上快速的最先进方法的优越性,以及与方法相当的鲁棒性。该代码在https://github.com//jiaxiaojunqaq//fgsm-sdi上发布。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译